Skip to content

USHIFT-6951: Add metrics exporters as optional MicroShift components#6763

Draft
copejon wants to merge 12 commits into
openshift:mainfrom
copejon:ushift-6951-metrics-exporters
Draft

USHIFT-6951: Add metrics exporters as optional MicroShift components#6763
copejon wants to merge 12 commits into
openshift:mainfrom
copejon:ushift-6951-metrics-exporters

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented May 29, 2026

Deliver metrics-server, kube-state-metrics, and node-exporter as optional MicroShift components packaged in a single microshift-metrics RPM. When installed alongside microshift-observability, the otel-collector automatically scrapes all three exporters via drop-in config.

What's included

  • microshift-metrics RPM — kustomize manifests for metrics-server, kube-state-metrics, and node-exporter with SCC RBAC, service-ca TLS, and health check registration
  • metrics-server kubelet client cert — dedicated cert under the kube-apiserver-to-kubelet-signer provisioned as a Secret, avoiding hostPath SELinux issues
  • Auto-rebase integration — CMO clone + manifest copy recipe in handle_assets.py, plus update_metrics_images() in rebase.sh to populate arch-specific overlays and release JSON from the OCP payload
  • Otel-collector drop-inmicroshift-metrics.yaml scrape configs for all three exporters using Kubernetes endpoints SD with service-CA TLS

Made with Cursor

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional metrics-server component for Kubernetes cluster metrics
    • Added kube-state-metrics for monitoring Kubernetes resource states
    • Added node-exporter for host and node-level metrics
    • Integrated OpenTelemetry Collector for metrics collection and export
  • Chores

    • Added RPM packages for metrics components (x86_64 and aarch64)
    • Added RHEL 10 support in development environment tools
    • Updated deployment manifests and Kubernetes RBAC configurations for metrics workloads

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 29, 2026

@copejon: This pull request references USHIFT-6951 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Deliver metrics-server, kube-state-metrics, and node-exporter as optional MicroShift components packaged in a single microshift-metrics RPM. When installed alongside microshift-observability, the otel-collector automatically scrapes all three exporters via drop-in config.

What's included

  • microshift-metrics RPM — kustomize manifests for metrics-server, kube-state-metrics, and node-exporter with SCC RBAC, service-ca TLS, and health check registration
  • metrics-server kubelet client cert — dedicated cert under the kube-apiserver-to-kubelet-signer provisioned as a Secret, avoiding hostPath SELinux issues
  • Auto-rebase integration — CMO clone + manifest copy recipe in handle_assets.py, plus update_metrics_images() in rebase.sh to populate arch-specific overlays and release JSON from the OCP payload
  • Otel-collector drop-inmicroshift-metrics.yaml scrape configs for all three exporters using Kubernetes endpoints SD with service-CA TLS

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 240fac84-8424-42e1-9105-110e8f97cbb4

📥 Commits

Reviewing files that changed from the base of the PR and between e53c5d4 and 765807b.

📒 Files selected for processing (4)
  • packaging/observability/otelcol.d/microshift-metrics-kube-state-metrics.yaml
  • packaging/observability/otelcol.d/microshift-metrics-node-exporter.yaml
  • packaging/observability/otelcol.d/microshift-metrics-server.yaml
  • packaging/rpm/microshift.spec

Walkthrough

This PR integrates three optional metrics collection components (metrics-server, kube-state-metrics, node-exporter) into MicroShift via Kubernetes manifests, RPM packaging, and automated build/deployment infrastructure. The implementation includes certificate provisioning, multi-architecture image overrides, and OpenTelemetry collector integration.

Changes

Metrics Components

Layer / File(s) Summary
Metrics component manifests and kustomization
assets/optional/metrics-server/*, assets/optional/kube-state-metrics/*, assets/optional/node-exporter/*
Kubernetes manifests define three metrics exporters with service accounts, RBAC roles/bindings, deployments/daemonsets, services, and supporting ConfigMaps. Kustomization files provide base overlays and per-architecture (aarch64/x86_64) image remappings with SHA256 digests.
Certificate provisioning and TLS material
assets/components/metrics-server/*, pkg/cmd/init.go, pkg/cmd/metrics.go, pkg/cmd/run.go, pkg/util/cryptomaterial/certinfo.go
Metrics-server kubelet client certificates are provisioned at startup; CSR configuration in init.go defines the system:metrics-server identity, provisionMetricsServerCerts() polls for the namespace and applies TLS secrets/ConfigMaps, and a background goroutine in RunMicroshift coordinates provisioning.
RPM subpackage definitions and installation
packaging/rpm/microshift.spec
Four new RPM subpackages package the three metrics components and release metadata; %install creates OTEL drop-in directories, installs per-arch kustomization manifests and release JSON, and OTEL collector drop-in YAMLs marked %config(noreplace).
Build and rebase automation for metrics manifests
scripts/auto-rebase/assets_metrics.yaml, scripts/auto-rebase/rebase.sh
Auto-rebase script clones cluster-monitoring-operator and copies metrics assets via handle_assets.py, regenerates per-arch release metadata JSON, and rewrites kustomization image overrides from OCP release digests.
Runtime integration and observability
packaging/observability/microshift-observability.service, packaging/observability/otelcol.d/*, pkg/healthcheck/microshift_optional_workloads.go
Systemd service now loads per-component OTel drop-in YAMLs; three new drop-in configs define Prometheus receivers for metrics-server, kube-state-metrics, and node-exporter. Health checks recognize the new optional workloads and merge namespace expectations across multiple manifest paths.
Development and test infrastructure
scripts/devenv-builder/*, test/bin/common.sh
VM creation scripts detect RHEL 10+ and enable UEFI boot; ISO filename helper recognizes RHEL 10 versions; test allowlist adds metrics RPM packages.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • pacevedom
  • ggiguash

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
Container-Privileges ❌ Error Node-exporter has justified privileged: true and hostPID/hostNetwork for monitoring, but kube-state-metrics containers lack allowPrivilegeEscalation: false explicitly set, defaulting to unsafe. Add allowPrivilegeEscalation: false to all kube-state-metrics containers. Node-exporter privileges are acceptable given its system monitoring requirements.
Docstring Coverage ⚠️ Warning Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning metrics-server has toleration for master nodes, and node-exporter has wildcard operator:Exists toleration. Both risk scheduling to arbiter nodes on TNA topology. Exclude arbiter nodes via nodeAffinity or label selectors. Replace node-exporter's operator:Exists with specific tolerations that exclude node-role.kubernetes.io/arbiter:NoSchedule.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main changeset: adding three metrics exporters (metrics-server, kube-state-metrics, node-exporter) as optional MicroShift components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds no Ginkgo tests. Test files in this codebase use Go's standard testing package (testing.T), not Ginkgo, so test name stability check is not applicable.
Test Structure And Quality ✅ Passed No Ginkgo test files (*_test.go) were added or modified in this PR. The custom check for Ginkgo test structure is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The PR only modifies regular source code files and configuration manifests for adding optional metrics exporters.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests (It/Describe/Context/When) were added in this PR. Changes are infrastructure manifests, Go source code, packaging, and scripts—not e2e tests.
Ote Binary Stdout Contract ✅ Passed This PR modifies operational MicroShift service code (metrics components) and manifests, not OTE/test extension binaries. The OTE stdout contract check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests—only Kubernetes manifests, Go functions, scripts, and RPM specs. IPv6/disconnected network check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography found. All TLS uses strong ciphers (ECDHE+AES-GCM/ChaCha20-Poly1305), standard Go crypto libraries, no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, and no unsafe comparisons.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data logging found. All logging statements log only non-sensitive items: file paths, namespace names, component names, generic error messages, and manifest paths.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from agullon and pacevedom May 29, 2026 00:32
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

♻️ Duplicate comments (1)
assets/optional/metrics-server/release-metrics-aarch64.json (1)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify placeholder replacement for release base and image digests.

All placeholder values ("base": "placeholder" and sha256:placeholder digests) must be replaced before deployment. Confirm that update_metrics_images() populates these values from the OCP payload during the build process.

Also applies to: 6-8

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/metrics-server/release-metrics-aarch64.json` at line 3, The
JSON contains placeholder values ("base": "placeholder" and image digests like
sha256:placeholder) that must be populated from the OCP payload; update the
update_metrics_images() function so it reads the OCP payload during the build,
replaces the "base": "placeholder" field and any sha256:placeholder digests in
release-metrics-aarch64.json (and the other affected files) with the actual
values, and fail the build with a clear error if any placeholders remain after
processing to prevent deployment with placeholders.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@assets/optional/kube-state-metrics/02-clusterrole.yaml`:
- Around line 6-22: The ClusterRole currently grants cluster-wide list/watch on
"secrets" in the resources block; remove "secrets" from that resources list in
the ClusterRole manifest (or move it into a separate opt-in ClusterRole/Role
bound only where secret metrics are explicitly enabled) so the
kube-state-metrics workload no longer has cluster-wide secret read access;
update any docs/values that toggle secret metrics to clearly opt-in if you
reintroduce a separate role.

In `@assets/optional/kube-state-metrics/04-deployment.yaml`:
- Around line 106-111: The manifest pins kube-state-metrics to control-plane
nodes via the nodeSelector (kubernetes.io/os: linux +
node-role.kubernetes.io/master) and a matching toleration; remove the
nodeSelector and the tolerations block from the Deployment so the workload is
not forced onto master/control-plane nodes (or if a softer preference is
required, replace with a preferredDuringSchedulingIgnoredDuringExecution
nodeAffinity targeting non-control-plane nodes instead of a hard
nodeSelector/toleration).
- Around line 38-41: The securityContext blocks (securityContext in the
container spec) are missing explicit capability drops; update each container's
securityContext to include capabilities.drop: ["ALL"] and then add back only any
strictly required capabilities via capabilities.add if needed (i.e., modify the
existing securityContext entries where allowPrivilegeEscalation,
readOnlyRootFilesystem, runAsNonRoot are set); apply the same change to the
other container securityContext occurrences in this file so all containers
explicitly drop all capabilities and only re-add minimal required ones.
- Around line 25-105: The pod is missing liveness/readiness probes and resource
limits for all three containers (kube-state-metrics, kube-rbac-proxy-main,
kube-rbac-proxy-self); for each container add a resources.limits stanza (e.g.
kube-state-metrics limits cpu/memory higher than requests; kube-rbac-proxy-main
and kube-rbac-proxy-self likewise) and add readinessProbe and livenessProbe
entries: for kube-state-metrics use an HTTP GET to port 8081 path /metrics (or
/healthz if preferred) and for kube-rbac-proxy-main/self use HTTPS HTTPGet
probes pointing to their secure-listen ports (8443 and 9443) with appropriate
scheme: HTTPS and path /readyz or /healthz, plus reasonable
initialDelaySeconds/periodSeconds/timeoutSeconds values; place these probes and
limits under the corresponding container blocks (identified by name:
kube-state-metrics, kube-rbac-proxy-main, kube-rbac-proxy-self).

In `@assets/optional/metrics-server/00-namespace.yaml`:
- Around line 5-9: The namespace-level pod security labels
pod-security.kubernetes.io/enforce, pod-security.kubernetes.io/audit, and
pod-security.kubernetes.io/warn are set to privileged and must be removed from
the namespace manifest to avoid opt-out of baseline/restricted controls; delete
those three labels from the namespace YAML and instead apply the exception only
to the specific workload by adding the equivalent labels/annotations on the
Deployment/DaemonSet/Pod manifest(s) for the metrics server or bind a targeted
ServiceAccount to a custom/restricted SCC (or cluster-scoped
SecurityContextConstraints) so only that workload runs with elevated privileges
rather than the entire openshift-monitoring namespace.

In `@assets/optional/metrics-server/05-deployment.yaml`:
- Around line 91-96: The manifest currently pins the pod to control-plane nodes
via the nodeSelector and tolerations entries (nodeSelector, tolerations, key:
node-role.kubernetes.io/master); remove these fields so the Deployment no longer
hard-requires scheduling on master/control-plane nodes, or replace them with a
non-mandatory soft constraint (e.g.,
preferredDuringSchedulingIgnoredDuringExecution style affinity) if you need a
preference—ensure no direct reference to node-role.kubernetes.io/master or
control-plane remains in the Deployment spec.
- Around line 57-60: The container spec currently only sets resource requests
(resources.requests.cpu and resources.requests.memory); add corresponding
resource limits by adding a resources.limits block with both cpu and memory
(e.g., limits.cpu and limits.memory) for the same container entry so the pod
cannot overconsume under pressure—update the resources section in the Deployment
container spec (the resources field under the container definition in this YAML)
to include limits for cpu and memory matching your expected upper bounds.
- Around line 61-63: The manifest's securityContext currently sets
allowPrivilegeEscalation and readOnlyRootFilesystem but omits runAsNonRoot and
capability drops; update the securityContext block (the same object containing
allowPrivilegeEscalation and readOnlyRootFilesystem) to set runAsNonRoot: true
and add capabilities: drop: ["ALL"] (and only add back any specific capabilities
if the container truly needs them), ensuring the hardening is enforced in the
deployment spec.

In `@assets/optional/node-exporter/02-clusterrole.yaml`:
- Around line 16-22: The ClusterRole currently grants access to the built-in
privileged SCC by listing resourceNames: - privileged for
securitycontextconstraints; replace this by creating and referencing a custom,
minimal SCC (e.g., node-exporter-scc) that only allows the specific host
namespaces and capabilities the DaemonSet requires, then update the
ClusterRole/RoleBinding to reference resourceNames: - node-exporter-scc (or the
custom SCC name) instead of privileged; ensure the ServiceAccount used by
node-exporter is the subject of the binding and that the custom SCC limits
capabilities, hostPID, hostNetwork, hostPath and any other host access to the
minimum necessary.

In `@assets/optional/node-exporter/04-daemonset.yaml`:
- Around line 25-26: The DaemonSet currently sets hostPID: true which
unnecessarily broadens the pod's privileges; remove the hostPID: true entry from
the Pod spec (spec.template.spec.hostPID) so the pod does not join the host PID
namespace unless a collector explicitly requires it, and verify the exporter
still functions using /host/root and /host/sys mounts; if a collector truly
requires host PID, add a short inline comment next to hostPID and a
justification in the PR describing why the exception is necessary and document
which collector needs it.
- Around line 27-96: Add resource limits and liveness/readiness probes for both
containers: for container "node-exporter" (name node-exporter) add
resources.limits for cpu and memory (e.g., match requests ratio like cpu: 50m,
memory: 64Mi) and add a readinessProbe and livenessProbe performing an HTTP GET
to path /metrics on port 9100 (scheme HTTP) with sensible timing
(initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold); for
container "kube-rbac-proxy" (name kube-rbac-proxy) add resources.limits (e.g.,
cpu: 50m, memory: 64Mi) and readinessProbe/livenessProbe using HTTP GET to path
/metrics on port 9101 but with scheme HTTPS (and tls settings honored) and
similar timing values; ensure these probe blocks are added under each container
and adjust values to match cluster SLOs.
- Around line 58-84: Update the kube-rbac-proxy container's securityContext to
harden the sidecar: add allowPrivilegeEscalation: false, set runAsNonRoot: true
and an explicit non-root runAsUser (choose a non-zero UID), and drop all Linux
capabilities (capabilities.drop: ["ALL"] and only add back any minimal
capability if proven necessary); keep readOnlyRootFilesystem: true and do not
change other mounts. Target the container named "kube-rbac-proxy" in the
DaemonSet spec when applying these changes.

In `@assets/optional/node-exporter/kustomization.aarch64.yaml`:
- Around line 1-7: Replace the placeholder digest values for the two images in
kustomization.aarch64.yaml: update the digest fields under the image entries for
"quay.io/openshift/node-exporter" and "quay.io/openshift/kube-rbac-proxy" with
the actual sha256 digests that correspond to the referenced newName images
(registry.redhat.io/openshift4/ose-prometheus-node-exporter-rhel9 and
registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9); obtain the exact
digests from your image registry (e.g., skopeo inspect or the registry API/CI
output) and commit those real sha256:... values in place of "sha256:placeholder"
so the aarch64 overlay pins to real images.

In `@packaging/observability/otelcol.d/microshift-metrics.yaml`:
- Around line 10-27: The HTTPS scrapes discovered via
kubernetes_sd_configs.role: endpoints will use pod IPs for TLS ServerName and
fail hostname verification; update the TLS config for the
prometheus/kube_state_metrics scrape (job_name: kube-state-metrics) and likewise
for prometheus/metrics_server to include tls_config.server_name set to the
service FQDN (for example kube-state-metrics.openshift-monitoring.svc) so
certificate hostname verification uses the service DNS name instead of the pod
IP; add tls_config.server_name alongside the existing ca_file entry under
prometheus/kube_state_metrics (and the equivalent block for
prometheus/metrics_server).

In `@pkg/healthcheck/microshift_optional_workloads.go`:
- Around line 42-53: The three manifest entries for openshift-monitoring are
clobbering each other because fillOptionalMicroShiftWorkloads() keeps a single
NamespaceWorkloads per namespace; update the initialization so entries that
target the same Namespace ("openshift-monitoring") are merged rather than
overwritten: when building the map of NamespaceWorkloads (type
NamespaceWorkloads) combine Deployments and DaemonSets arrays (e.g., merge
"metrics-server" and "kube-state-metrics" into Deployments and "node-exporter"
into DaemonSets) before assigning to the namespace key in
fillOptionalMicroShiftWorkloads(), ensuring duplicates are avoided.

In `@scripts/auto-rebase/rebase.sh`:
- Around line 1136-1159: The METRICS_JSON_MAP associative array currently misses
the kube-rbac-proxy entry, so update METRICS_JSON_MAP (declared as
METRICS_JSON_MAP=( ... )) to include a mapping for the kube-rbac-proxy image
(e.g., add ["kube_rbac_proxy"]="kube-rbac-proxy" or a consistent key name used
elsewhere), so the for-loop that iterates json_key/release_tag/new_image will
pick up and write kube-rbac-proxy into the generated
release-metrics-${arch}.json files the same way as
metrics_server/kube_state_metrics/node_exporter.

In `@scripts/devenv-builder/create-vm.sh`:
- Around line 58-63: The RHEL major detection in create-vm.sh uses grep -P with
a pattern that can fail under set -euo when VMNAME contains just "-10" (no dot),
causing the script to exit and BOOT_OPTS to never be set; change the logic to
robustly derive RHEL_MAJOR (prefer MICROSHIFT_RHEL_VERSION if present, otherwise
extract digits from VMNAME) and avoid grep failing on no-match (e.g., use grep
-oE '\d+' with a fallback like "|| true" or use pure shell parameter expansion
to strip the suffix and parse digits), then compute RHEL_MAJOR=${RHEL_MAJOR%%.*}
or cut -d. -f1 and keep the existing BOOT_OPTS assignment using that value.

---

Duplicate comments:
In `@assets/optional/metrics-server/release-metrics-aarch64.json`:
- Line 3: The JSON contains placeholder values ("base": "placeholder" and image
digests like sha256:placeholder) that must be populated from the OCP payload;
update the update_metrics_images() function so it reads the OCP payload during
the build, replaces the "base": "placeholder" field and any sha256:placeholder
digests in release-metrics-aarch64.json (and the other affected files) with the
actual values, and fail the build with a clear error if any placeholders remain
after processing to prevent deployment with placeholders.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6e479cdc-b516-4f90-bdff-ac9801aa36ee

📥 Commits

Reviewing files that changed from the base of the PR and between a0c1df6 and 25ea1dc.

📒 Files selected for processing (45)
  • assets/components/metrics-server/kubelet-ca-configmap.yaml
  • assets/components/metrics-server/kubelet-client-secret.yaml
  • assets/optional/kube-state-metrics/01-serviceaccount.yaml
  • assets/optional/kube-state-metrics/02-clusterrole.yaml
  • assets/optional/kube-state-metrics/03-clusterrolebinding.yaml
  • assets/optional/kube-state-metrics/04-deployment.yaml
  • assets/optional/kube-state-metrics/05-service.yaml
  • assets/optional/kube-state-metrics/kustomization.aarch64.yaml
  • assets/optional/kube-state-metrics/kustomization.x86_64.yaml
  • assets/optional/kube-state-metrics/kustomization.yaml
  • assets/optional/metrics-server/00-namespace.yaml
  • assets/optional/metrics-server/01-serviceaccount.yaml
  • assets/optional/metrics-server/02-clusterrole.yaml
  • assets/optional/metrics-server/03-clusterrolebinding.yaml
  • assets/optional/metrics-server/04-rolebinding.yaml
  • assets/optional/metrics-server/05-deployment.yaml
  • assets/optional/metrics-server/06-service.yaml
  • assets/optional/metrics-server/07-apiservice.yaml
  • assets/optional/metrics-server/kustomization.aarch64.yaml
  • assets/optional/metrics-server/kustomization.x86_64.yaml
  • assets/optional/metrics-server/kustomization.yaml
  • assets/optional/metrics-server/release-metrics-aarch64.json
  • assets/optional/metrics-server/release-metrics-x86_64.json
  • assets/optional/node-exporter/01-serviceaccount.yaml
  • assets/optional/node-exporter/02-clusterrole.yaml
  • assets/optional/node-exporter/03-clusterrolebinding.yaml
  • assets/optional/node-exporter/04-daemonset.yaml
  • assets/optional/node-exporter/05-service.yaml
  • assets/optional/node-exporter/kustomization.aarch64.yaml
  • assets/optional/node-exporter/kustomization.x86_64.yaml
  • assets/optional/node-exporter/kustomization.yaml
  • packaging/observability/microshift-observability.service
  • packaging/observability/otelcol.d/microshift-metrics.yaml
  • packaging/rpm/microshift.spec
  • pkg/cmd/init.go
  • pkg/cmd/metrics.go
  • pkg/cmd/run.go
  • pkg/healthcheck/microshift_optional_workloads.go
  • pkg/util/cryptomaterial/certinfo.go
  • scripts/auto-rebase/assets_metrics.yaml
  • scripts/auto-rebase/rebase.sh
  • scripts/devenv-builder/config/kickstart.ks.template
  • scripts/devenv-builder/create-vm.sh
  • scripts/devenv-builder/manage-vm.sh
  • test/bin/common.sh

Comment thread assets/optional/kube-state-metrics/02-clusterrole.yaml Outdated
Comment on lines +25 to +105
containers:
- name: kube-state-metrics
image: quay.io/openshift/kube-state-metrics:latest
imagePullPolicy: IfNotPresent
args:
- --host=127.0.0.1
- --port=8081
- --telemetry-host=127.0.0.1
- --telemetry-port=8082
resources:
requests:
cpu: 10m
memory: 64Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
- name: kube-rbac-proxy-main
image: quay.io/openshift/kube-rbac-proxy:latest
imagePullPolicy: IfNotPresent
args:
- --secure-listen-address=:8443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- --tls-min-version=VersionTLS12
- --upstream=http://127.0.0.1:8081/
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
ports:
- containerPort: 8443
name: https-main
protocol: TCP
resources:
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
volumeMounts:
- name: metrics-tls
mountPath: /etc/tls/private
readOnly: true
- name: tmp
mountPath: /tmp
- name: kube-rbac-proxy-self
image: quay.io/openshift/kube-rbac-proxy:latest
imagePullPolicy: IfNotPresent
args:
- --secure-listen-address=:9443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- --tls-min-version=VersionTLS12
- --upstream=http://127.0.0.1:8082/
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
ports:
- containerPort: 9443
name: https-self
protocol: TCP
resources:
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
volumeMounts:
- name: metrics-tls
mountPath: /etc/tls/private
readOnly: true
- name: tmp-self
mountPath: /tmp
volumes:
- name: metrics-tls
secret:
secretName: kube-state-metrics-tls
- name: tmp
emptyDir: {}
- name: tmp-self
emptyDir: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add probes and limits for all three containers.

This pod defines requests only, and none of the containers have liveness/readiness probes. That leaves the rollout without health gating and the workload without hard resource ceilings.

As per coding guidelines, "Resource limits (cpu, memory) on every container" and "Liveness + readiness probes defined".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/04-deployment.yaml` around lines 25 - 105,
The pod is missing liveness/readiness probes and resource limits for all three
containers (kube-state-metrics, kube-rbac-proxy-main, kube-rbac-proxy-self); for
each container add a resources.limits stanza (e.g. kube-state-metrics limits
cpu/memory higher than requests; kube-rbac-proxy-main and kube-rbac-proxy-self
likewise) and add readinessProbe and livenessProbe entries: for
kube-state-metrics use an HTTP GET to port 8081 path /metrics (or /healthz if
preferred) and for kube-rbac-proxy-main/self use HTTPS HTTPGet probes pointing
to their secure-listen ports (8443 and 9443) with appropriate scheme: HTTPS and
path /readyz or /healthz, plus reasonable
initialDelaySeconds/periodSeconds/timeoutSeconds values; place these probes and
limits under the corresponding container blocks (identified by name:
kube-state-metrics, kube-rbac-proxy-main, kube-rbac-proxy-self).

Comment thread assets/optional/kube-state-metrics/04-deployment.yaml Outdated
Comment thread assets/optional/kube-state-metrics/04-deployment.yaml Outdated
Comment on lines +5 to +9
labels:
name: openshift-monitoring
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid namespace-wide privileged pod security labels.

Setting enforce, audit, and warn to privileged makes the whole namespace opt out of baseline/restricted controls. Scope the exception to the specific workload/SCC instead of weakening every pod in openshift-monitoring.

As per coding guidelines, **/*.{yaml,yml}: OpenShift: SCC must be restricted or custom-scoped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/metrics-server/00-namespace.yaml` around lines 5 - 9, The
namespace-level pod security labels pod-security.kubernetes.io/enforce,
pod-security.kubernetes.io/audit, and pod-security.kubernetes.io/warn are set to
privileged and must be removed from the namespace manifest to avoid opt-out of
baseline/restricted controls; delete those three labels from the namespace YAML
and instead apply the exception only to the specific workload by adding the
equivalent labels/annotations on the Deployment/DaemonSet/Pod manifest(s) for
the metrics server or bind a targeted ServiceAccount to a custom/restricted SCC
(or cluster-scoped SecurityContextConstraints) so only that workload runs with
elevated privileges rather than the entire openshift-monitoring namespace.

Comment on lines +1 to +7
images:
- name: quay.io/openshift/node-exporter
newName: registry.redhat.io/openshift4/ose-prometheus-node-exporter-rhel9
digest: sha256:placeholder
- name: quay.io/openshift/kube-rbac-proxy
newName: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9
digest: sha256:placeholder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Replace the placeholder digests before merge.

sha256:placeholder makes the aarch64 overlay unusable and breaks image pinning for that architecture. If automation is expected to rewrite this, the generated digests need to be committed here, like the x86_64 overlay already does.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/node-exporter/kustomization.aarch64.yaml` around lines 1 - 7,
Replace the placeholder digest values for the two images in
kustomization.aarch64.yaml: update the digest fields under the image entries for
"quay.io/openshift/node-exporter" and "quay.io/openshift/kube-rbac-proxy" with
the actual sha256 digests that correspond to the referenced newName images
(registry.redhat.io/openshift4/ose-prometheus-node-exporter-rhel9 and
registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9); obtain the exact
digests from your image registry (e.g., skopeo inspect or the registry API/CI
output) and commit those real sha256:... values in place of "sha256:placeholder"
so the aarch64 overlay pins to real images.

Comment on lines +10 to +27
prometheus/kube_state_metrics:
config:
scrape_configs:
- job_name: kube-state-metrics
scrape_interval: 30s
scheme: https
tls_config:
ca_file: /var/lib/microshift/certs/service-ca/ca.crt
kubernetes_sd_configs:
- kubeconfig_file: /var/lib/microshift/resources/observability-client/kubeconfig
role: endpoints
namespaces:
names: [openshift-monitoring]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-state-metrics;https-main

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="packaging/observability/otelcol.d/microshift-metrics.yaml"

# Show relevant sections around the referenced line ranges
echo "=== File: $file (lines 1-120) ==="
nl -ba "$file" | sed -n '1,120p'

echo "=== Search for tls_config/server_name and __address__ relabeling in the file ==="
rg -n "server_name|tls_config:|__address__|relabel_configs" "$file" || true

echo "=== Show more context around kube_state_metrics and metrics_server if present ==="
rg -n "kube-state-metrics|metrics-server" -n "$file" || true

Repository: openshift/microshift

Length of output: 211


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="packaging/observability/otelcol.d/microshift-metrics.yaml"

echo "File exists? $(test -f "$file" && echo yes || echo no)"

echo "=== First 200 lines (numbered) of $file ==="
cat -n "$file" | sed -n '1,200p'

echo "=== Search for tls_config/server_name and address relabeling ==="
rg -n "server_name|tls_config:|__address__|relabel_configs" "$file" || true

echo "=== Lines matching kube-state-metrics or metrics-server ==="
rg -n "kube-state-metrics|metrics-server" "$file" -n || true

echo "=== Show around lines 1-120 (non-numbered) for context ==="
sed -n '1,120p' "$file"

Repository: openshift/microshift

Length of output: 4965


🌐 Web query:

Prometheus scrape_config tls_config server_name default behavior hostname verification when server_name not set

💡 Result:

In Prometheus scrape_configs.tls_config, server_name is the TLS “ServerName” used for both (1) certificate hostname verification and (2) sending an SNI value in the client handshake. If you do not set tls_config.server_name, Prometheus will not send an explicit ServerName based on that field; instead, it relies on the underlying Go TLS behavior. 1) When tls_config.server_name is set - Prometheus passes server_name into Go’s tls.Config as Config.ServerName, which Go uses to verify the hostname on the returned certificates (unless InsecureSkipVerify is true) and to include in the client handshake for virtual hosting (SNI), unless the name is an IP address. [1] - Prometheus’ own docs describe server_name as “ServerName extension to indicate the name of the server.” [2] 2) Default behavior when tls_config.server_name is NOT set - Prometheus leaves TLSConfig.ServerName unset/empty. - In Go’s crypto/tls client dial logic, if Config.ServerName is empty, it infers ServerName from the hostname portion of the address being dialed (the “hostname we’re connecting to”). [3] So, “default hostname verification” is effectively performed against the hostname part of the target address (e.g., the host in host:port), rather than being driven by tls_config.server_name. 3) Practical implication for TLS/SNI in Prometheus scraping - If your target’s certificate is issued to a DNS name (e.g., original-sni.example.com) but the Prometheus scrape target address is an IP or some other host name, certificate verification and/or SNI can fail unless you set tls_config.server_name to the expected DNS name. This issue is commonly observed in mTLS setups where curl with an explicit SNI works, but the equivalent Prometheus config fails when server_name doesn’t match what the server expects for SNI. [4] Key takeaways - tls_config.server_name missing => Go derives TLS ServerName from the target’s dial hostname, and that value is used for certificate hostname verification (unless insecure_skip_verify is true). [3][1] - tls_config.server_name set => that value is used for hostname verification and SNI. [1]

Citations:


🌐 Web query:

OpenShift service-ca CA cert SANs service DNS does it include pod IPs

💡 Result:

OpenShift service-ca (service serving certificate) does not include pod IPs in the CA/service-serving certificate SANs. What SANs it includes (DNS): the generated service serving certificates are “only valid for the internal service DNS name …svc” [1][2]. If the service is headless (no clusterIP set), the certificate also contains a wildcard subject in the format of “*…svc” [1][2]. So, for service-ca-issued service serving certs: - SANs include internal service DNS names (…svc, and for headless services also a wildcard …svc) [1][2]. - SANs do not include pod IPs (the docs describe validity scoped to service DNS/internal communications, not pod IP addressing) [1][2]. Note: separate from this, service-ca can inject the service CA bundle (service-ca.crt) into ConfigMaps/other resources; that is about trust distribution, not SAN contents of service serving certs [1][2].

Citations:


Set tls_config.server_name for endpoint-discovered HTTPS scrapes (avoid pod-IP TLS hostname verification failures)

File: packaging/observability/otelcol.d/microshift-metrics.yaml (10-45)

Both prometheus/kube_state_metrics and prometheus/metrics_server use kubernetes_sd_configs.role: endpoints, so the HTTPS targets will be the discovered endpoint addresses (typically pod IPs). With only ca_file set, Prometheus will use the target’s dial host (pod IP) as the TLS ServerName for hostname verification, while service-serving certs are issued for service DNS names (…​.svc) and not pod IPs—so these scrapes are likely to fail TLS hostname verification.

Suggested fix
   prometheus/kube_state_metrics:
     config:
       scrape_configs:
         - job_name: kube-state-metrics
           scrape_interval: 30s
           scheme: https
           tls_config:
             ca_file: /var/lib/microshift/certs/service-ca/ca.crt
+            server_name: kube-state-metrics.openshift-monitoring.svc
           kubernetes_sd_configs:
             - kubeconfig_file: /var/lib/microshift/resources/observability-client/kubeconfig
               role: endpoints
               namespaces:
                 names: [openshift-monitoring]

   prometheus/metrics_server:
     config:
       scrape_configs:
         - job_name: metrics-server
           scrape_interval: 30s
           scheme: https
           tls_config:
             ca_file: /var/lib/microshift/certs/service-ca/ca.crt
+            server_name: metrics-server.openshift-monitoring.svc
           kubernetes_sd_configs:
             - kubeconfig_file: /var/lib/microshift/resources/observability-client/kubeconfig
               role: endpoints
               namespaces:
                 names: [openshift-monitoring]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packaging/observability/otelcol.d/microshift-metrics.yaml` around lines 10 -
27, The HTTPS scrapes discovered via kubernetes_sd_configs.role: endpoints will
use pod IPs for TLS ServerName and fail hostname verification; update the TLS
config for the prometheus/kube_state_metrics scrape (job_name:
kube-state-metrics) and likewise for prometheus/metrics_server to include
tls_config.server_name set to the service FQDN (for example
kube-state-metrics.openshift-monitoring.svc) so certificate hostname
verification uses the service DNS name instead of the pod IP; add
tls_config.server_name alongside the existing ca_file entry under
prometheus/kube_state_metrics (and the equivalent block for
prometheus/metrics_server).

Comment thread pkg/healthcheck/microshift_optional_workloads.go
Comment on lines +1136 to +1159
# Maps release JSON key -> OCP release tag name
declare -A METRICS_JSON_MAP=(
["metrics_server"]="metrics-server"
["kube_state_metrics"]="kube-state-metrics"
["node_exporter"]="prometheus-node-exporter"
)

for goarch in amd64 arm64; do
arch=${GOARCH_TO_UNAME_MAP["${goarch}"]:-noarch}

local release_file="${STAGING_DIR}/release_${goarch}.json"
local metrics_release_json="${REPOROOT}/assets/optional/metrics-server/release-metrics-${arch}.json"

local base_release
base_release=$(jq -r ".metadata.version" "${release_file}")
jq -n "{\"release\": {\"base\": \"$base_release\"}, \"images\": {}}" > "${metrics_release_json}"

# Update release-metrics-${arch}.json
for json_key in "${!METRICS_JSON_MAP[@]}"; do
local release_tag="${METRICS_JSON_MAP[$json_key]}"
local new_image
new_image=$(jq -r ".references.spec.tags[] | select(.name == \"${release_tag}\") | .from.name" "${release_file}")
yq -i -o json ".images += {\"${json_key}\": \"${new_image}\"}" "${metrics_release_json}"
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include kube-rbac-proxy in release-metrics-*.json.

METRICS_JSON_MAP omits kube-rbac-proxy, even though the kustomization rebasing logic handles it. That leaves microshift-metrics-release-info incomplete for disconnected/blueprint flows that rely on the generated image list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/auto-rebase/rebase.sh` around lines 1136 - 1159, The METRICS_JSON_MAP
associative array currently misses the kube-rbac-proxy entry, so update
METRICS_JSON_MAP (declared as METRICS_JSON_MAP=( ... )) to include a mapping for
the kube-rbac-proxy image (e.g., add ["kube_rbac_proxy"]="kube-rbac-proxy" or a
consistent key name used elsewhere), so the for-loop that iterates
json_key/release_tag/new_image will pick up and write kube-rbac-proxy into the
generated release-metrics-${arch}.json files the same way as
metrics_server/kube_state_metrics/node_exporter.

Comment on lines +58 to +63
# RHEL 10+ requires UEFI boot
BOOT_OPTS=""
RHEL_MAJOR=$(echo "${VMNAME}" | grep -oP '(\d+)\.\d+' | cut -d. -f1)
if [ "${RHEL_MAJOR:-0}" -ge 10 ]; then
BOOT_OPTS="--boot uefi"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Default VM name format:"
sed -n '40,44p' scripts/devenv-builder/manage-vm.sh

echo
echo "Current RHEL major detection:"
sed -n '58,63p' scripts/devenv-builder/create-vm.sh

echo
echo "Regex result for a default RHEL 10 VM name:"
printf '%s\n' 'microshift-user-host-10' | grep -oP '(\d+)\.\d+' || echo 'NO_MATCH'

Repository: openshift/microshift

Length of output: 470


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== create-vm.sh context (lines ~50-80) ==="
nl -ba scripts/devenv-builder/create-vm.sh | sed -n '45,85p'

echo
echo "=== manage-vm.sh context (lines ~30-70) ==="
nl -ba scripts/devenv-builder/manage-vm.sh | sed -n '30,80p'

echo
echo "=== Bash set -e + pipefail + failing grep in command substitution test ==="
cat > /tmp/sete_cmdsub_test.sh <<'BASH'
#!/usr/bin/env bash
set -euo pipefail

VMNAME="microshift-user-host-10"

RHEL_MAJOR=$(echo "${VMNAME}" | grep -oP '(\d+)\.\d+' | cut -d. -f1)
echo "RHEL_MAJOR=${RHEL_MAJOR}"
BASH
chmod +x /tmp/sete_cmdsub_test.sh
/tmp/sete_cmdsub_test.sh || echo "Script exited non-zero (as expected if set -e aborts)"

Repository: openshift/microshift

Length of output: 169


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== create-vm.sh context (lines 45-85) ==="
sed -n '45,85p' scripts/devenv-builder/create-vm.sh | cat -n

echo
echo "=== manage-vm.sh context (lines 30-80) ==="
sed -n '30,80p' scripts/devenv-builder/manage-vm.sh | cat -n

echo
echo "=== Bash set -e + pipefail + failing grep in command substitution test ==="
cat > /tmp/sete_cmdsub_test.sh <<'BASH'
#!/usr/bin/env bash
set -euo pipefail

VMNAME="microshift-user-host-10"

RHEL_MAJOR=$(echo "${VMNAME}" | grep -oP '(\d+)\.\d+' | cut -d. -f1)
echo "RHEL_MAJOR=${RHEL_MAJOR}"
BASH
chmod +x /tmp/sete_cmdsub_test.sh
/tmp/sete_cmdsub_test.sh || echo "Script exited non-zero"

Repository: openshift/microshift

Length of output: 3349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== create-vm.sh lines 70-80 ==="
sed -n '70,80p' scripts/devenv-builder/create-vm.sh | cat -n

Repository: openshift/microshift

Length of output: 589


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== create-vm.sh header (first 40 lines) ==="
sed -n '1,40p' scripts/devenv-builder/create-vm.sh | cat -n

echo
echo "=== Find get_vm_name calls in manage-vm.sh ==="
rg -n "get_vm_name" scripts/devenv-builder/manage-vm.sh

echo
echo "=== Show manage-vm.sh around VM name generation ==="
rg -n "VMNAME|get_vm_name|rhel_version" scripts/devenv-builder/manage-vm.sh -n -S
sed -n '1,160p' scripts/devenv-builder/manage-vm.sh | cat -n | sed -n '1,160p'

Repository: openshift/microshift

Length of output: 10398


🏁 Script executed:

#!/bin/bash
sed -n '74p' scripts/devenv-builder/create-vm.sh | cat -A

Repository: openshift/microshift

Length of output: 84


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# exact line numbers for the problematic block
rg -n "RHEL_MAJOR=|BOOT_OPTS=|RHEL 10\+ requires UEFI boot" scripts/devenv-builder/create-vm.sh

# show the full block with exact line numbers
awk 'NR>=0{print}' scripts/devenv-builder/create-vm.sh | awk 'NR>=1 && NR<=120' >/dev/null

Repository: openshift/microshift

Length of output: 216


Fix RHEL boot detection to not parse ${VMNAME} with (\d+)\.\d+ under set -euo pipefail.

manage-vm.sh can set MICROSHIFT_RHEL_VERSION=10, producing VMNAME=...-10. In scripts/devenv-builder/create-vm.sh (lines 58-62), RHEL_MAJOR=$(... | grep -oP '(\d+)\.\d+' ...) then finds no match, and with set -e the script exits before virt-install, so BOOT_OPTS never becomes --boot uefi.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/devenv-builder/create-vm.sh` around lines 58 - 63, The RHEL major
detection in create-vm.sh uses grep -P with a pattern that can fail under set
-euo when VMNAME contains just "-10" (no dot), causing the script to exit and
BOOT_OPTS to never be set; change the logic to robustly derive RHEL_MAJOR
(prefer MICROSHIFT_RHEL_VERSION if present, otherwise extract digits from
VMNAME) and avoid grep failing on no-match (e.g., use grep -oE '\d+' with a
fallback like "|| true" or use pure shell parameter expansion to strip the
suffix and parse digits), then compute RHEL_MAJOR=${RHEL_MAJOR%%.*} or cut -d.
-f1 and keep the existing BOOT_OPTS assignment using that value.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@copejon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-tests-bootc-el9 25ea1dc link true /test e2e-aws-tests-bootc-el9
ci/prow/e2e-aws-tests-bootc-arm-el9 25ea1dc link true /test e2e-aws-tests-bootc-arm-el9
ci/prow/e2e-aws-tests-bootc-periodic-el9 25ea1dc link true /test e2e-aws-tests-bootc-periodic-el9
ci/prow/e2e-aws-tests-arm 25ea1dc link true /test e2e-aws-tests-arm
ci/prow/e2e-aws-tests-bootc-periodic-arm-el10 25ea1dc link true /test e2e-aws-tests-bootc-periodic-arm-el10
ci/prow/e2e-aws-tests-periodic 25ea1dc link true /test e2e-aws-tests-periodic
ci/prow/e2e-aws-tests-bootc-periodic-el10 25ea1dc link true /test e2e-aws-tests-bootc-periodic-el10
ci/prow/e2e-aws-tests 25ea1dc link true /test e2e-aws-tests
ci/prow/e2e-aws-tests-bootc-el10 25ea1dc link true /test e2e-aws-tests-bootc-el10
ci/prow/test-rebase 25ea1dc link false /test test-rebase
ci/prow/verify 25ea1dc link true /test verify
ci/prow/e2e-aws-tests-periodic-arm 25ea1dc link true /test e2e-aws-tests-periodic-arm
ci/prow/e2e-aws-tests-bootc-arm-el10 25ea1dc link true /test e2e-aws-tests-bootc-arm-el10
ci/prow/e2e-aws-tests-bootc-periodic-arm-el9 25ea1dc link true /test e2e-aws-tests-bootc-periodic-arm-el9

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@copejon copejon marked this pull request as draft May 29, 2026 04:33
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026
copejon and others added 11 commits June 3, 2026 15:08
Add kustomize manifests for metrics-server, kube-state-metrics, and
node-exporter as optional components delivered via a single
microshift-metrics RPM. Includes SCC RBAC grants (hostmount-anyuid
for metrics-server, privileged for node-exporter), service-ca TLS
integration, health check registration, and release-info packaging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hostPath cert volumes with Kubernetes Secrets/ConfigMaps to
work under enforcing SELinux (container_t cannot read container_var_lib_t
files). Generate a dedicated metrics-server-kubelet-client cert under
the kube-apiserver-to-kubelet-signer and provision it into the
openshift-monitoring namespace before kustomize applies the manifests.

- Add metrics-server-kubelet-client cert to kube-apiserver-to-kubelet-signer chain
- Add provisionMetricsServerCerts() to create Secret/ConfigMap from cert on disk
- Replace hostPath volumes with Secret/ConfigMap references in deployment
- Remove SCC use rule (no longer needs hostmount-anyuid)
- Add system:metrics-server User subject to ClusterRoleBinding for kubelet auth
- Prefer InternalIP over Hostname for kubelet address resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add drop-in config infrastructure so the otel-collector can scrape
metrics exporters installed by the microshift-metrics RPM. The service
file dynamically loads YAML configs from otelcol.d/ and the collector
merges them with the base config.

Move metrics-server cert provisioning to run asynchronously after
kustomize, waiting for the namespace to exist via PollUntilContextTimeout
before creating the kubelet client Secret and CA ConfigMap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CMO (cluster-monitoring-operator) to the rebase infrastructure so
metrics exporter manifests stay in sync with upstream during future
rebases. CMO is cloned via a new OPTIONAL_COMPONENTS list and its
manifests are copied using a handle_assets.py recipe.

RBAC files (serviceaccount, clusterrole, clusterrolebinding, service)
are taken directly from CMO. Deployments and namespace use git_restore
to preserve MicroShift's single-node adaptations while surfacing
upstream changes for human review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the microshift-metrics otel-collector drop-in to scrape
kube-state-metrics and metrics-server in addition to node-exporter.
Both new receivers use Kubernetes endpoints service discovery with
service-CA TLS, targeting the openshift-monitoring namespace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add update_metrics_images() to rebase.sh to populate architecture-
specific kustomization overlays and release-metrics JSON files from
the OCP release payload. Covers metrics-server, kube-state-metrics,
node-exporter, and kube-rbac-proxy images for both x86_64 and aarch64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…O upstream

Rename all manifests from numeric prefixes to CMO-matching names for
both metrics-server and node-exporter. Add missing CMO files verbatim
(configmap-audit-profiles, network-policy, PDB, SCC, kube-rbac-proxy
secret, accelerators configmap). Fix metrics-server deployment image
placeholder from empty string to quay.io/openshift/metrics-server so
the kustomize images transformer can match and inject the OCP digest.
Replace git_restore with ignore + reason in the rebase recipe since
they are functionally identical under no_clean and ignore communicates
why MicroShift diverges from CMO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add type-grouped numeric prefixes to metrics-server manifest files
to enforce correct deployment ordering via kustomize. Grouping:
00-namespace, 01-RBAC/SA, 02-configmap, 03-workload, 04-service.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename KSM manifests from old sequential scheme (01-serviceaccount
through 05-service) to type-grouped CMO filenames with prefixes:
01-RBAC/SA, 02-configmap/secret, 03-workload, 04-service. Add two
new CMO files: kube-rbac-proxy-secret and custom-resource-state
configmap. Update spec install loop for new naming pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add type-grouped numeric prefixes to node-exporter manifest files:
01-RBAC/SA/SCC, 02-configmap/secret, 03-workload, 04-service.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiple metrics exporters deploy to the openshift-monitoring namespace.
The previous code overwrote the workloads map entry on each iteration,
so only the last component's workloads were checked. Add mergeWorkloads()
to append workloads from all components sharing a namespace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@copejon copejon force-pushed the ushift-6951-metrics-exporters branch from 5843fac to e53c5d4 Compare June 3, 2026 20:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@assets/optional/kube-state-metrics/01-cluster-role.yaml`:
- Around line 12-30: Remove cluster-wide secrets access from the ClusterRole by
deleting the "secrets" entry from the resources list in the Role/ClusterRole
definition (the block that currently lists configmaps, secrets, nodes, pods,
services, serviceaccounts, ...). Edit the resources array so it no longer
includes "secrets" (leaving other resources intact) to enforce least-privilege
RBAC for the kube-state-metrics exporter.

In `@assets/optional/kube-state-metrics/02-custom-resource-state-configmap.yaml`:
- Around line 417-470: The HELP texts for the metrics
verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_cpu and
verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_memory
are incorrect — they say "Minimum" but the metric exposes maxAllowed; update
both help strings to refer to "Maximum" (e.g., "Maximum cpu resources..." and
"Maximum memory resources...") so the Prometheus series are labeled correctly;
locate the help fields adjacent to those metric names and change the wording
accordingly.

In `@assets/optional/kube-state-metrics/03-deployment.yaml`:
- Around line 64-68: Update the pod-level securityContext and each container's
securityContext in the Deployment manifest (the pod spec and the three container
blocks shown around the empty securityContext entries) to enforce non-root and
hardened settings: set pod.securityContext.runAsNonRoot: true and a non-root
runAsUser (e.g., 1000) if not already set; for each container set
securityContext.runAsNonRoot: true, runAsUser consistent with the pod,
allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, privileged:
false, and capabilities.drop: ["ALL"] (and add only any minimal capabilities
required by that specific container); also add a seccompProfile: { type:
RuntimeDefault } at the pod or container level. Ensure these changes are applied
to the three container blocks and the pod-level securityContext referred to in
the diff.
- Around line 35-126: Add liveness and readiness probes plus CPU/memory limits
for each container: for the kube-state-metrics container (name:
kube-state-metrics) add an HTTPGet liveness and readiness probe targeting the
exporter on localhost:8081 (or the telemetry port 8082 if appropriate) with
sensible initialDelay/period/timeout, and add resources.limits.cpu and
resources.limits.memory alongside the existing requests; for the
kube-rbac-proxy-main and kube-rbac-proxy-self containers (names:
kube-rbac-proxy-main and kube-rbac-proxy-self) add liveness/readiness probes
(HTTPGet or TCP) pointing at their container ports 8443 and 9443 respectively
(use /healthz or /readyz or /metrics as appropriate) with sensible timing
values, and add matching resources.limits.cpu and resources.limits.memory for
each container to bound resource usage. Ensure probe fields and limits are added
under each container spec where args, image, name, and resources exist.

In `@assets/optional/node-exporter/03-daemonset.yaml`:
- Line 129: The DaemonSet sets hostPID: true which unnecessarily exposes the
host PID namespace; remove the hostPID: true line from the DaemonSet spec to
follow the guideline "No hostPID, hostNetwork, hostIPC, privileged: true", and
verify the exporter still works using the existing mounts (/host/root/proc and
/host/sys); only reintroduce hostPID if a specific collector explicitly requires
host PID visibility and document that requirement next to the hostPID setting.
- Around line 144-146: The init container "init-textfile" is unnecessarily
running privileged and as root; update its securityContext to remove privileged:
true and runAsUser: 0, and instead set runAsNonRoot: true, runAsUser to a
non-root UID (e.g., 1000) or remove runAsUser so the image defaults to non-root,
set readOnlyRootFilesystem: true, and allowPrivilegeEscalation: false; ensure no
hostPID/hostNetwork/hostIPC are enabled for that container and retain only the
minimal mounts (the emptyDir textfile and the read-only /var/log/wtmp) so the
init container is least-privileged.
- Around line 65-69: The node-exporter container currently has an empty
securityContext and no resource limits; update the container spec for
"node-exporter" to explicitly set securityContext: runAsNonRoot: true,
readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, and drop all
capabilities (capabilities: { drop: ["ALL"] }) then add back only the minimal
required capability if needed (e.g., NET_RAW) via add. Also add resource limits
to match requests (set resources.limits.cpu and resources.limits.memory
alongside the existing resources.requests.cpu and resources.requests.memory).
Ensure these changes are applied to the container entry named "node-exporter" in
the DaemonSet manifest.
- Around line 88-99: The kube-rbac-proxy daemonset currently sets
--secure-listen-address=[$(IP)]:9100 (and uses env name IP) while running with
hostNetwork: true, which binds to the node IP and is not reachable by the
collector that scrapes 127.0.0.1:9100; change the proxy to bind to loopback
instead or make the collector target the node IP. Concretely, in the DaemonSet
arguments for kube-rbac-proxy (the --secure-listen-address entry and the env
name: IP usage) replace binding to [$ (IP)] with 127.0.0.1:9100 (or remove the
envRef usage) so the proxy listens on localhost, or alternatively update the
collector scrape target to use the pod/node IP; apply the same fix for the
duplicated instance referenced around the second occurrence (lines noted in the
comment).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 26026edc-818b-432f-8470-dab3e234aa07

📥 Commits

Reviewing files that changed from the base of the PR and between 25ea1dc and 5843fac.

📒 Files selected for processing (34)
  • assets/components/metrics-server/kubelet-ca-configmap.yaml
  • assets/components/metrics-server/kubelet-client-secret.yaml
  • assets/optional/kube-state-metrics/01-cluster-role-binding.yaml
  • assets/optional/kube-state-metrics/01-cluster-role.yaml
  • assets/optional/kube-state-metrics/01-service-account.yaml
  • assets/optional/kube-state-metrics/02-custom-resource-state-configmap.yaml
  • assets/optional/kube-state-metrics/02-kube-rbac-proxy-secret.yaml
  • assets/optional/kube-state-metrics/03-deployment.yaml
  • assets/optional/kube-state-metrics/04-service.yaml
  • assets/optional/kube-state-metrics/kustomization.yaml
  • assets/optional/metrics-server/01-cluster-role-binding-auth-delegator.yaml
  • assets/optional/metrics-server/01-cluster-role-binding.yaml
  • assets/optional/metrics-server/01-cluster-role.yaml
  • assets/optional/metrics-server/01-role-binding-auth-reader.yaml
  • assets/optional/metrics-server/01-service-account.yaml
  • assets/optional/metrics-server/02-configmap-audit-profiles.yaml
  • assets/optional/metrics-server/03-deployment.yaml
  • assets/optional/metrics-server/04-api-service.yaml
  • assets/optional/metrics-server/04-service.yaml
  • assets/optional/metrics-server/kustomization.yaml
  • assets/optional/metrics-server/network-policy-downstream.yaml
  • assets/optional/metrics-server/pod-disruption-budget.yaml
  • assets/optional/node-exporter/01-cluster-role-binding.yaml
  • assets/optional/node-exporter/01-cluster-role.yaml
  • assets/optional/node-exporter/01-security-context-constraints.yaml
  • assets/optional/node-exporter/01-service-account.yaml
  • assets/optional/node-exporter/02-accelerators-collector-configmap.yaml
  • assets/optional/node-exporter/02-kube-rbac-proxy-secret.yaml
  • assets/optional/node-exporter/03-daemonset.yaml
  • assets/optional/node-exporter/04-service.yaml
  • assets/optional/node-exporter/kustomization.yaml
  • packaging/rpm/microshift.spec
  • pkg/healthcheck/microshift_optional_workloads.go
  • scripts/auto-rebase/assets_metrics.yaml
✅ Files skipped from review due to trivial changes (12)
  • assets/optional/kube-state-metrics/01-cluster-role-binding.yaml
  • assets/optional/metrics-server/01-cluster-role-binding.yaml
  • assets/optional/metrics-server/01-cluster-role.yaml
  • assets/optional/metrics-server/network-policy-downstream.yaml
  • assets/optional/metrics-server/01-service-account.yaml
  • assets/optional/metrics-server/04-service.yaml
  • assets/components/metrics-server/kubelet-client-secret.yaml
  • assets/optional/node-exporter/01-security-context-constraints.yaml
  • assets/optional/node-exporter/01-service-account.yaml
  • assets/optional/kube-state-metrics/01-service-account.yaml
  • assets/optional/metrics-server/03-deployment.yaml
  • assets/optional/kube-state-metrics/04-service.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • assets/optional/kube-state-metrics/kustomization.yaml
  • assets/optional/metrics-server/kustomization.yaml
  • packaging/rpm/microshift.spec

Comment on lines +12 to +30
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop cluster-wide secrets access from this role.

Lines 15-16 let this exporter list/watch every Secret in the cluster. That exposes full secret payloads if the pod or service account is compromised, which is a much larger blast radius than the rest of this read-only metrics role needs.

Suggested fix
 - apiGroups:
   - ""
   resources:
   - configmaps
-  - secrets
   - nodes
   - pods
   - services
As per coding guidelines, `RBAC: least privilege; no cluster-admin for workloads`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/01-cluster-role.yaml` around lines 12 -
30, Remove cluster-wide secrets access from the ClusterRole by deleting the
"secrets" entry from the resources list in the Role/ClusterRole definition (the
block that currently lists configmaps, secrets, nodes, pods, services,
serviceaccounts, ...). Edit the resources array so it no longer includes
"secrets" (leaving other resources intact) to enforce least-privilege RBAC for
the kube-state-metrics exporter.

Comment on lines +417 to +470
- "commonLabels":
"resource": "cpu"
"unit": "cores"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "cpu"
"type": "Gauge"
"help": "Minimum cpu resources the VerticalPodAutoscaler can set for containers matching the name."
"labelsFromPath":
"namespace":
- "metadata"
- "namespace"
"target_api_version":
- "spec"
- "targetRef"
- "apiVersion"
"target_kind":
- "spec"
- "targetRef"
- "kind"
"target_name":
- "spec"
- "targetRef"
- "name"
"verticalpodautoscaler":
- "metadata"
- "name"
"name": "verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_cpu"
- "commonLabels":
"resource": "memory"
"unit": "bytes"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "memory"
"type": "Gauge"
"help": "Minimum memory resources the VerticalPodAutoscaler can set for containers matching the name."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the maxAllowed HELP text.

Lines 433 and 470 still say "Minimum" even though these metrics expose maxAllowed. That mislabels the series in Prometheus output.

Suggested fix
-          "help": "Minimum cpu resources the VerticalPodAutoscaler can set for containers matching the name."
+          "help": "Maximum cpu resources the VerticalPodAutoscaler can set for containers matching the name."
...
-          "help": "Minimum memory resources the VerticalPodAutoscaler can set for containers matching the name."
+          "help": "Maximum memory resources the VerticalPodAutoscaler can set for containers matching the name."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- "commonLabels":
"resource": "cpu"
"unit": "cores"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "cpu"
"type": "Gauge"
"help": "Minimum cpu resources the VerticalPodAutoscaler can set for containers matching the name."
"labelsFromPath":
"namespace":
- "metadata"
- "namespace"
"target_api_version":
- "spec"
- "targetRef"
- "apiVersion"
"target_kind":
- "spec"
- "targetRef"
- "kind"
"target_name":
- "spec"
- "targetRef"
- "name"
"verticalpodautoscaler":
- "metadata"
- "name"
"name": "verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_cpu"
- "commonLabels":
"resource": "memory"
"unit": "bytes"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "memory"
"type": "Gauge"
"help": "Minimum memory resources the VerticalPodAutoscaler can set for containers matching the name."
- "commonLabels":
"resource": "cpu"
"unit": "cores"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "cpu"
"type": "Gauge"
"help": "Maximum cpu resources the VerticalPodAutoscaler can set for containers matching the name."
"labelsFromPath":
"namespace":
- "metadata"
- "namespace"
"target_api_version":
- "spec"
- "targetRef"
- "apiVersion"
"target_kind":
- "spec"
- "targetRef"
- "kind"
"target_name":
- "spec"
- "targetRef"
- "name"
"verticalpodautoscaler":
- "metadata"
- "name"
"name": "verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_cpu"
- "commonLabels":
"resource": "memory"
"unit": "bytes"
"each":
"gauge":
"labelsFromPath":
"container":
- "containerName"
"path":
- "spec"
- "resourcePolicy"
- "containerPolicies"
"valueFrom":
- "maxAllowed"
- "memory"
"type": "Gauge"
"help": "Maximum memory resources the VerticalPodAutoscaler can set for containers matching the name."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/02-custom-resource-state-configmap.yaml`
around lines 417 - 470, The HELP texts for the metrics
verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_cpu and
verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed_memory
are incorrect — they say "Minimum" but the metric exposes maxAllowed; update
both help strings to refer to "Maximum" (e.g., "Maximum cpu resources..." and
"Maximum memory resources...") so the Prometheus series are labeled correctly;
locate the help fields adjacent to those metric names and change the wording
accordingly.

Comment on lines +35 to +126
containers:
- args:
- --host=127.0.0.1
- --port=8081
- --telemetry-host=127.0.0.1
- --telemetry-port=8082
- --custom-resource-state-config-file=/etc/kube-state-metrics/custom-resource-state-configmap.yaml
- |
--metric-denylist=
^kube_secret_labels$,
^kube_.+_annotations$,
^kube_customresource_.+_annotations_info$,
^kube_customresource_.+_labels_info$
- --metric-labels-allowlist=pods=[*],nodes=[*],namespaces=[*],persistentvolumes=[*],persistentvolumeclaims=[*],poddisruptionbudgets=[*]
- |
--metric-denylist=
^kube_.+_created$,
^kube_.+_metadata_resource_version$,
^kube_replicaset_metadata_generation$,
^kube_replicaset_status_observed_generation$,
^kube_pod_restart_policy$,
^kube_pod_init_container_status_terminated$,
^kube_pod_init_container_status_running$,
^kube_pod_container_status_terminated$,
^kube_pod_container_status_running$,
^kube_pod_completion_time$,
^kube_pod_status_scheduled$
image: quay.io/openshift/kube-state-metrics
name: kube-state-metrics
resources:
requests:
cpu: 2m
memory: 80Mi
securityContext: {}
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp
name: volume-directive-shadow
readOnly: false
- mountPath: /etc/kube-state-metrics
name: kube-state-metrics-custom-resource-state-configmap
readOnly: true
- args:
- --secure-listen-address=:8443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- --upstream=http://127.0.0.1:8081/
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
- --config-file=/etc/kube-rbac-policy/config.yaml
image: quay.io/openshift/kube-rbac-proxy
name: kube-rbac-proxy-main
ports:
- containerPort: 8443
name: https-main
resources:
requests:
cpu: 1m
memory: 15Mi
securityContext: {}
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/tls/private
name: kube-state-metrics-tls
readOnly: false
- mountPath: /etc/kube-rbac-policy
name: kube-state-metrics-kube-rbac-proxy-config
readOnly: true
- args:
- --secure-listen-address=:9443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- --upstream=http://127.0.0.1:8082/
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
- --config-file=/etc/kube-rbac-policy/config.yaml
image: quay.io/openshift/kube-rbac-proxy
name: kube-rbac-proxy-self
ports:
- containerPort: 9443
name: https-self
resources:
requests:
cpu: 1m
memory: 15Mi
securityContext: {}
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/tls/private
name: kube-state-metrics-tls
readOnly: false
- mountPath: /etc/kube-rbac-policy
name: kube-state-metrics-kube-rbac-proxy-config
readOnly: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add probes and resource limits for all three containers.

None of the containers defines liveness/readiness probes or cpu/memory limits. That means a hung exporter/proxy can stay in service indefinitely, and resource growth is left unbounded.

As per coding guidelines, "Resource limits (cpu, memory) on every container" and "Liveness + readiness probes defined".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/03-deployment.yaml` around lines 35 - 126,
Add liveness and readiness probes plus CPU/memory limits for each container: for
the kube-state-metrics container (name: kube-state-metrics) add an HTTPGet
liveness and readiness probe targeting the exporter on localhost:8081 (or the
telemetry port 8082 if appropriate) with sensible initialDelay/period/timeout,
and add resources.limits.cpu and resources.limits.memory alongside the existing
requests; for the kube-rbac-proxy-main and kube-rbac-proxy-self containers
(names: kube-rbac-proxy-main and kube-rbac-proxy-self) add liveness/readiness
probes (HTTPGet or TCP) pointing at their container ports 8443 and 9443
respectively (use /healthz or /readyz or /metrics as appropriate) with sensible
timing values, and add matching resources.limits.cpu and resources.limits.memory
for each container to bound resource usage. Ensure probe fields and limits are
added under each container spec where args, image, name, and resources exist.

Comment on lines +64 to +68
resources:
requests:
cpu: 2m
memory: 80Mi
securityContext: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit container hardening here.

All three containers and the pod use empty securityContexts, so this manifest is missing the required non-root, no-privilege-escalation, dropped-capabilities, read-only-rootfs, and seccomp settings. Relying on SCC defaults leaves this optional component with a weaker security contract than the repo guidelines require.

As per coding guidelines, "securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false" and "Drop ALL capabilities, add only what is required".

Also applies to: 89-93, 114-118, 130-130

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/03-deployment.yaml` around lines 64 - 68,
Update the pod-level securityContext and each container's securityContext in the
Deployment manifest (the pod spec and the three container blocks shown around
the empty securityContext entries) to enforce non-root and hardened settings:
set pod.securityContext.runAsNonRoot: true and a non-root runAsUser (e.g., 1000)
if not already set; for each container set securityContext.runAsNonRoot: true,
runAsUser consistent with the pod, allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, privileged: false, and capabilities.drop: ["ALL"]
(and add only any minimal capabilities required by that specific container);
also add a seccompProfile: { type: RuntimeDefault } at the pod or container
level. Ensure these changes are applied to the three container blocks and the
pod-level securityContext referred to in the diff.

Comment on lines +65 to +69
resources:
requests:
cpu: 8m
memory: 32Mi
securityContext: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden the main exporter container too.

node-exporter is the only long-lived container here with an empty securityContext and no resource limits, even though it is the container with direct host mounts. That leaves the most exposed container dependent on defaults instead of an explicit manifest contract.

As per coding guidelines, "securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false", "Drop ALL capabilities, add only what is required", and "Resource limits (cpu, memory) on every container".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/node-exporter/03-daemonset.yaml` around lines 65 - 69, The
node-exporter container currently has an empty securityContext and no resource
limits; update the container spec for "node-exporter" to explicitly set
securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true,
allowPrivilegeEscalation: false, and drop all capabilities (capabilities: {
drop: ["ALL"] }) then add back only the minimal required capability if needed
(e.g., NET_RAW) via add. Also add resource limits to match requests (set
resources.limits.cpu and resources.limits.memory alongside the existing
resources.requests.cpu and resources.requests.memory). Ensure these changes are
applied to the container entry named "node-exporter" in the DaemonSet manifest.

Comment thread assets/optional/node-exporter/03-daemonset.yaml
name: node-exporter-kube-rbac-proxy-config
readOnly: true
hostNetwork: true
hostPID: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop hostPID unless a collector really needs the host PID namespace.

The exporter is already pointed at /host/root/proc and /host/sys, so the host PID namespace does not appear necessary for the behavior shown here. Keeping it enabled expands process visibility for no clear benefit.

As per coding guidelines, "No hostPID, hostNetwork, hostIPC, privileged: true".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/node-exporter/03-daemonset.yaml` at line 129, The DaemonSet
sets hostPID: true which unnecessarily exposes the host PID namespace; remove
the hostPID: true line from the DaemonSet spec to follow the guideline "No
hostPID, hostNetwork, hostIPC, privileged: true", and verify the exporter still
works using the existing mounts (/host/root/proc and /host/sys); only
reintroduce hostPID if a specific collector explicitly requires host PID
visibility and document that requirement next to the hostPID setting.

Comment on lines +144 to +146
securityContext:
privileged: true
runAsUser: 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The init container is over-privileged for the mounts shown here.

init-textfile only gets the textfile emptyDir plus a read-only /var/log/wtmp mount, but it still runs privileged as UID 0. That is a much broader permission set than the manifest visibly requires.

As per coding guidelines, "No hostPID, hostNetwork, hostIPC, privileged: true" and "securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/node-exporter/03-daemonset.yaml` around lines 144 - 146, The
init container "init-textfile" is unnecessarily running privileged and as root;
update its securityContext to remove privileged: true and runAsUser: 0, and
instead set runAsNonRoot: true, runAsUser to a non-root UID (e.g., 1000) or
remove runAsUser so the image defaults to non-root, set readOnlyRootFilesystem:
true, and allowPrivilegeEscalation: false; ensure no hostPID/hostNetwork/hostIPC
are enabled for that container and retain only the minimal mounts (the emptyDir
textfile and the read-only /var/log/wtmp) so the init container is
least-privileged.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
assets/optional/kube-state-metrics/03-deployment.yaml (1)

96-98: ⚡ Quick win

TLS secret mount should be read-only.

The kube-state-metrics-tls secret volume is mounted with readOnly: false. Secrets should be mounted read-only unless modification is required.

Suggested fix
        volumeMounts:
        - mountPath: /etc/tls/private
          name: kube-state-metrics-tls
-          readOnly: false
+          readOnly: true

Also applies to line 121-123.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/kube-state-metrics/03-deployment.yaml` around lines 96 - 98,
The TLS secret volume for kube-state-metrics is mounted with readOnly: false;
change the volumeMount for name: kube-state-metrics-tls (mountPath:
/etc/tls/private) to readOnly: true, and make the same change for the other
kube-state-metrics-tls mount instance referenced later in the file so both
secret mounts are read-only.
assets/optional/node-exporter/03-daemonset.yaml (1)

121-124: ⚡ Quick win

TLS secret mount should be read-only.

Same as kube-state-metrics: the TLS secret is mounted with readOnly: false.

Suggested fix
        volumeMounts:
        - mountPath: /etc/tls/private
          name: node-exporter-tls
-          readOnly: false
+          readOnly: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/optional/node-exporter/03-daemonset.yaml` around lines 121 - 124, The
TLS secret volumeMount for the node-exporter container is currently writable;
change the volumeMount entry named "node-exporter-tls" (mountPath
"/etc/tls/private") to be readOnly: true so the secret cannot be modified at
runtime; update the volumeMount under the container spec where "volumeMounts"
contains name: node-exporter-tls and mountPath: /etc/tls/private to set
readOnly: true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@assets/optional/kube-state-metrics/03-deployment.yaml`:
- Around line 96-98: The TLS secret volume for kube-state-metrics is mounted
with readOnly: false; change the volumeMount for name: kube-state-metrics-tls
(mountPath: /etc/tls/private) to readOnly: true, and make the same change for
the other kube-state-metrics-tls mount instance referenced later in the file so
both secret mounts are read-only.

In `@assets/optional/node-exporter/03-daemonset.yaml`:
- Around line 121-124: The TLS secret volumeMount for the node-exporter
container is currently writable; change the volumeMount entry named
"node-exporter-tls" (mountPath "/etc/tls/private") to be readOnly: true so the
secret cannot be modified at runtime; update the volumeMount under the container
spec where "volumeMounts" contains name: node-exporter-tls and mountPath:
/etc/tls/private to set readOnly: true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 79ac37c5-0900-42aa-b000-f7209d48285c

📥 Commits

Reviewing files that changed from the base of the PR and between 5843fac and e53c5d4.

📒 Files selected for processing (54)
  • assets/components/metrics-server/kubelet-ca-configmap.yaml
  • assets/components/metrics-server/kubelet-client-secret.yaml
  • assets/optional/kube-state-metrics/01-cluster-role-binding.yaml
  • assets/optional/kube-state-metrics/01-cluster-role.yaml
  • assets/optional/kube-state-metrics/01-service-account.yaml
  • assets/optional/kube-state-metrics/02-custom-resource-state-configmap.yaml
  • assets/optional/kube-state-metrics/02-kube-rbac-proxy-secret.yaml
  • assets/optional/kube-state-metrics/03-deployment.yaml
  • assets/optional/kube-state-metrics/04-service.yaml
  • assets/optional/kube-state-metrics/kustomization.aarch64.yaml
  • assets/optional/kube-state-metrics/kustomization.x86_64.yaml
  • assets/optional/kube-state-metrics/kustomization.yaml
  • assets/optional/metrics-server/00-namespace.yaml
  • assets/optional/metrics-server/01-cluster-role-binding-auth-delegator.yaml
  • assets/optional/metrics-server/01-cluster-role-binding.yaml
  • assets/optional/metrics-server/01-cluster-role.yaml
  • assets/optional/metrics-server/01-role-binding-auth-reader.yaml
  • assets/optional/metrics-server/01-service-account.yaml
  • assets/optional/metrics-server/02-configmap-audit-profiles.yaml
  • assets/optional/metrics-server/03-deployment.yaml
  • assets/optional/metrics-server/04-api-service.yaml
  • assets/optional/metrics-server/04-service.yaml
  • assets/optional/metrics-server/kustomization.aarch64.yaml
  • assets/optional/metrics-server/kustomization.x86_64.yaml
  • assets/optional/metrics-server/kustomization.yaml
  • assets/optional/metrics-server/network-policy-downstream.yaml
  • assets/optional/metrics-server/pod-disruption-budget.yaml
  • assets/optional/metrics-server/release-metrics-aarch64.json
  • assets/optional/metrics-server/release-metrics-x86_64.json
  • assets/optional/node-exporter/01-cluster-role-binding.yaml
  • assets/optional/node-exporter/01-cluster-role.yaml
  • assets/optional/node-exporter/01-security-context-constraints.yaml
  • assets/optional/node-exporter/01-service-account.yaml
  • assets/optional/node-exporter/02-accelerators-collector-configmap.yaml
  • assets/optional/node-exporter/02-kube-rbac-proxy-secret.yaml
  • assets/optional/node-exporter/03-daemonset.yaml
  • assets/optional/node-exporter/04-service.yaml
  • assets/optional/node-exporter/kustomization.aarch64.yaml
  • assets/optional/node-exporter/kustomization.x86_64.yaml
  • assets/optional/node-exporter/kustomization.yaml
  • packaging/observability/microshift-observability.service
  • packaging/observability/otelcol.d/microshift-metrics.yaml
  • packaging/rpm/microshift.spec
  • pkg/cmd/init.go
  • pkg/cmd/metrics.go
  • pkg/cmd/run.go
  • pkg/healthcheck/microshift_optional_workloads.go
  • pkg/util/cryptomaterial/certinfo.go
  • scripts/auto-rebase/assets_metrics.yaml
  • scripts/auto-rebase/rebase.sh
  • scripts/devenv-builder/config/kickstart.ks.template
  • scripts/devenv-builder/create-vm.sh
  • scripts/devenv-builder/manage-vm.sh
  • test/bin/common.sh
✅ Files skipped from review due to trivial changes (8)
  • assets/optional/node-exporter/kustomization.aarch64.yaml
  • assets/optional/metrics-server/release-metrics-x86_64.json
  • assets/optional/metrics-server/01-cluster-role-binding-auth-delegator.yaml
  • assets/optional/metrics-server/04-api-service.yaml
  • assets/components/metrics-server/kubelet-ca-configmap.yaml
  • assets/optional/metrics-server/02-configmap-audit-profiles.yaml
  • assets/optional/node-exporter/kustomization.x86_64.yaml
  • assets/optional/kube-state-metrics/kustomization.x86_64.yaml
🚧 Files skipped from review as they are similar to previous changes (39)
  • assets/components/metrics-server/kubelet-client-secret.yaml
  • pkg/cmd/init.go
  • pkg/cmd/run.go
  • assets/optional/metrics-server/01-cluster-role-binding.yaml
  • assets/optional/metrics-server/kustomization.x86_64.yaml
  • assets/optional/node-exporter/01-security-context-constraints.yaml
  • assets/optional/metrics-server/00-namespace.yaml
  • assets/optional/metrics-server/01-service-account.yaml
  • assets/optional/kube-state-metrics/02-kube-rbac-proxy-secret.yaml
  • assets/optional/node-exporter/01-service-account.yaml
  • assets/optional/kube-state-metrics/01-cluster-role-binding.yaml
  • assets/optional/node-exporter/02-accelerators-collector-configmap.yaml
  • assets/optional/node-exporter/04-service.yaml
  • assets/optional/metrics-server/01-cluster-role.yaml
  • assets/optional/kube-state-metrics/04-service.yaml
  • assets/optional/metrics-server/network-policy-downstream.yaml
  • assets/optional/kube-state-metrics/01-service-account.yaml
  • assets/optional/metrics-server/04-service.yaml
  • assets/optional/kube-state-metrics/kustomization.aarch64.yaml
  • assets/optional/kube-state-metrics/kustomization.yaml
  • packaging/observability/otelcol.d/microshift-metrics.yaml
  • assets/optional/metrics-server/kustomization.yaml
  • scripts/devenv-builder/manage-vm.sh
  • assets/optional/node-exporter/02-kube-rbac-proxy-secret.yaml
  • assets/optional/kube-state-metrics/01-cluster-role.yaml
  • assets/optional/metrics-server/03-deployment.yaml
  • packaging/observability/microshift-observability.service
  • pkg/util/cryptomaterial/certinfo.go
  • assets/optional/node-exporter/kustomization.yaml
  • pkg/cmd/metrics.go
  • assets/optional/metrics-server/pod-disruption-budget.yaml
  • scripts/auto-rebase/assets_metrics.yaml
  • assets/optional/node-exporter/01-cluster-role.yaml
  • assets/optional/metrics-server/kustomization.aarch64.yaml
  • assets/optional/node-exporter/01-cluster-role-binding.yaml
  • pkg/healthcheck/microshift_optional_workloads.go
  • assets/optional/kube-state-metrics/02-custom-resource-state-configmap.yaml
  • packaging/rpm/microshift.spec
  • scripts/auto-rebase/rebase.sh

Split the single microshift-metrics RPM into three independently
installable packages to satisfy the USHIFT-6890 requirement that
each exporter be enabled individually:

- microshift-metrics-server
- microshift-metrics-kube-state-metrics
- microshift-metrics-node-exporter

All three can be installed together with: dnf install microshift-metrics-*

Split the combined otel-collector drop-in config into per-package
configs so each RPM ships its own scrape configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants